﻿Masteratul de Lingvistică Computațională Curs: Introducere in Lingvistica Computațională Curs 2 Metode și tehnologii aplicate textului Prelucrări subsintactice și sintactice Curs: Dan Cristea Seminarii & proiect: Mihaela Onofrei, Dan Cristea POS Tagger Prerequisites • Lexicon of words • For each word in the lexicon: information about all its possible tags according to a chosen tagset • Different methods for choosing the correct tag for a word: – Rule-based methods – Statistical methods – Transformation based tagging – Neural methods POS Tagger Prerequisites: Lexicon of words • Classes of words – Closed classes: a fixed set • Prepositions: in, by, at, of, … • Pronouns: I, you, he, her, them, … • Particles: on, off, … • Determiners: the, a, an, … • Conjunctions: or, and, but, … • Auxiliary verbs: can, may, should, … • Numerals: one, two, three, … – Open classes: new ones can be created all the time, therefore it is not possible that all words from these classes appear in the lexicon • Nouns • Verbs • Adjectives • Adverbs POS Tagger Prerequisites Tagsets • To do POS tagging, need to choose a standard set of tags to work with • A tagset is normally sophisticated and linguistically well grounded • One could pick very coarse tagsets – N, V, Adj, Adv • More commonly used sets are finer grained, the “UPenn TreeBank tagset”, 48 tags • Even more fine-grained tagsets exist POS Tagger Prerequisites Tagset example – UPenn tagset 1 2 CC CD Coordinating conjunction Cardinal number 3 DT Determiner 4 EX Existential there 5 FW Foreign word 6 IN Preposition/subord conjunction 7 JJ Adjective 8 JJR Adjective, comparative 9 JJS Adjective, superlative 10 LS List item marker 11 MD Modal 12 NN Noun, singular or mass 13 NNS Noun, plural 14 NNP Proper noun, singular 15 NNPS Proper noun, plural 16 PDT Predeterminer 17 POS Possessive ending 18 PRP Personal pronoun 19 PP Possessive pronoun 20 RB Adverb 21 RBR Adverb, comparative 22 RBS Adverb, superlative 23 24 RP SYM Particle Symbol (mathematical or scientific) 25 26 TO UH to Interjection 27 VB Verb, base form 28 VB D Verb, past tense 29 VB G Verb, gerund/present participle 30 VB N Verb, past participle 31 VB P Verb, non-3rd ps sing present 32 VB Z Verb, 3rd ps sing present 33 WDT wh-determiner 34 WP wh-pronoun 35 WP Possessive wh-pronoun 36 WR B wh-adverb 37 Pound sign 38 $ Dollar sign 39 Sentence-final punctuation 40 , Comma 41 : Colon, semi-colon 42 ( Left bracket character 43 ) Right bracket character 44 " Straight double quote 45 ` Left open single quote 46 " Left open double quote 47 48 ' " Right close single quote Right close double quote POS Tagging Rule based methods • Start with a dictionary • Assign all possible tags to words from the dictionary • Write rules by hand to selectively remove tags • Leaving the correct tag for each word POS Tagging Statistical methods (1) The Most Frequent Tag Algorithm • Training – Take a tagged corpus – Create a dictionary containing every word in the corpus together with all its possible tags – Count the number of times each tag occurs for a word and compute the probability P(tag|word); then save all probabilities • Ta g g i n g – Given a new sentence, for each word, pick the most frequent tag for that word from the corpus POS Tagging Statistical methods (2) Bigram HMM Tagger • Training – Create a dictionary containing every word in the corpus together with all its possible tags – Compute the probability of each tag associated with a certain word, compute the probability each tag is preceded by a specific tag (Bigram HMM Tagger => probability is dependent only on the previous tag) • Tagging – Given a new sentence, for each word, pick up the most likely tag for that word using the parameters obtained after training – HMM Taggers choose the tag sequence that maximizes this formula: P(word|tag) * P(tag|previous tag) Bigram HMM Tagging: Example People/NNS are/VBZ expected/VBN to/TO queue/VB at/IN the/DT registry/NNS The/DT police/NN is/VBZ to/TO blame/VB for/IN the/DT queue/NN • to/TO queue/??? the/DT queue/??? • => maxk P(tk|tk-1)*P(wi|tk) – wi the current word in sequence, tk = one possible tag for the current word • How do we compute P(tk|tk-1)? – count(tk-1tk )/count(tk-1) • How do we compute P(wi|tk)? – count(wi tk )/count(tk) • max[P(VB|TO)*P(queue|VB), P(NN|TO)*P(queue|NN)] • Corpus: – P(NN|TO) = 0 021 * P(queue|NN) = 0 00041 => 0 000007 – P(VB|TO) = 0 34 * P(queue|VB) = 0 00003 => 0 00001 POS Tagging Transformation Based Tagging • Basic Idea: – Set the most probable tag for each word as a start value – Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order • Training is done on tagged corpus: 1 Write a set of rule templates 2 Among the set of rules, find one with the highest score 3 Continue from 2 until lowest score threshold is passed 4 Keep the ordered set of rules • Rules make errors that are corrected by later rules Transformation Based Tagging Example • Tag every word with its most-likely tag – For example: race has the following probabilities in the Brown corpus: • P(NN|race) = 0 98 • P(VB|race)= 0 02 • Transformation rules make changes to tags – “Change NN to VB when previous tag is TO” … is/VBZ expected/VBN to/TO race/NN tomorrow/NN becomes … is/VBZ expected/VBN to/TO race/VB tomorrow/NN POS Taggers (1) ACOPOST Author(s): Jochen Hagenstroem, Kilian Foth, Ingo Schröder, Parantu Shah Purpose: ACOPOST is a collection of POS taggers It implements and extends well-known machine learning techniques and provides a uniform environment for testing Platforms: All POSIX (Linux/BSD/UNIX-like OSes) Access: Free at http://sourceforge net/projects/acopost/ BRILL'S TAGGER Author(s): Eric Brill Purpose: Transformation Based Learning POS Tagger Access: Free at http://www cs jhu edu/~brill fnTBL Author(s): Radu Florian and Grace Ngai, John Hopkins University, USA Purpose: fnTBL is a customizable, portable and free source machine-learning toolkit primarily oriented towards Natural Language-related tasks (POS tagging, base NP chunking, text chunking, end-of-sentence detection) It is currently trained for English and Swedish Platforms: Linux, Solaris, Windows Access: Free at http://nlp cs jhu edu/~rflorian/fntbl/ POS Taggers (2) LINGSOFT Author(s): LINGSOFT, Finland Purpose: Among the services offered by Lingsoft one can find POS taggers for Danish, English, German, Norwegian, Swedish Access: Not free Demos at http://www lingsoft fi/demos html LT POS (LT TTT) Author(s): Language Technology Group, University of Edinburgh, UK Purpose: The LT POS part of speech tagger uses a Hidden Markov Model disambiguation strategy It is currently trained only for English Access: Free but requires registration at http://www ltg ed ac uk/software/pos/index html MACHINESE PHRASE TAGGER Author(s): Connexor Purpose: Machinese Phrase Tagger is a set of program components that perform basic linguistic analysis tasks at very high speed and provide relevant information about words and concepts to volume-intensive applications Available for: English, French, Spanish, German, Dutch, Italian, Finnish Access: Not free Free access to online demo at http://www connexor com/demo/tagger/ POS Taggers (3) MXPOST Author(s): Adwait Ratnaparkhi Purpose: MXPOST is a maximum entropy POS tagger The downloadable version includes a Wall St Journal tagging model for English, but can also be trained for different languages Platforms: Platform independent Access: Free at http://www cis upenn edu/~adwait/statnlp html MEMORY BASED TAGGER Author(s): ILK - Tilburg University, CNTS - University of Antwerp Purpose: Memory-based tagging is based on the idea that words occurring in similar contexts will have the same POS tag The idea is implemented using the memory-based learning software package TiMBL Access: Usable by email or on the Web at http://ilk uvt nl/software html mbt µ-TBL Author(s): Torbjörn Lager Purpose: The µ-TBL system is a powerful environment in which to experiment with transformation-based learning Platforms: Windows Access: Free at http://www ling gu se/~lager/mutbl html POS Taggers (4) QTAG Author(s): Oliver Mason, Birmingham University, UK Purpose: QTag is a probabilistic parts-of-speech tagger Resource files for English and German can be downloaded together with the tool Platforms: Platform independent Access: Free at http://www english bham ac uk/staff/omason/software/qtag html STANFORD POS TAGGER Author(s): Kristina Toutanova, Stanford University, USA Purpose: The Stanford POS tagger is a log-linear tagger written in Java The downloadable package includes components for command-line invocation and a Java API both for training and for running a trained tagger Platforms: Platform independent Access: Free at http://nlp stanford edu/software/tagger shtml SVM TOOL Author(s): TALP Research Center, University of Catalunya, Spain Purpose: The SVMTool is a simple and effective part-of-speech tagger based on Support Vector Machines The SVMLight software implementation of Vapnik's Support Vector Machine by Thosten Joachims has been used to train the models for Catalan, English and Spanish Access: Free SVMTool at http://www lsi upc es/~nlp/SVMTool/ and SVMLight at http://svmlight joachims org/ POS Taggers (5) TnT Author(s): Thorsten Brants, Saarland University, Germany Purpose: TnT, the short form of Trigrams'n'Tags, is a very efficient statistical part-of-speech tagger that is trainable on different languages and virtually any tagset The tagger is an implementation of the Viterbi algorithm for second order Markov models TnT comes with two language models, one for German, and one for English Platforms: Platform independent Access: Free but requires registration at http://www coli uni-saarland de/~thorsten/tnt/ TREETAGGER Author(s): Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart, Germany Purpose: The TreeTagger has been successfully used to tag German, English, French, Italian, Spanish, Greek and old French texts and is easily adaptable to other languages if a lexicon and a manually tagged training corpus are available Access: Free at http://www ims uni-stuttgart de/projekte/corplex/TreeTagger/DecisionTreeTagger html POS Taggers (6) Xerox XRCE MLTT Part Of Speech Taggers Author(s): Xerox Research Centre Europe Purpose: Xerox has developed morphological analysers and part-of-speech disambiguators for various languages including Dutch, English, French, German, Italian, Portuguese, Spanish More recent developments include Czech, Hungarian, Polish and Russian Access: Not free Demos at http://www xrce xerox com/competencies/content-analysis/fsnlp/tagger en html YAMCHA Author(s): Taku Kudo Purpose: YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking YamCha is using Support Vector Machines (SVMs), first introduced by Vapnik in 1995 YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task Platforms: Linux, Windows Access: Free at http://www2 chasen org/~taku/software/yamcha/ Stemming • Stemmers are used in IR to reduce as many related words and word forms as possible to a common canonical form – not necessarily the base form – which can then be used in the retrieval process • Frequently, the performance of an IR system will be improved if term groups such as: CONNECT, CONNECTED, CONNECTING, CONNECTION, CONNECTIONS are conflated into a single term (by removal of the various suffixes -ED, -ING, -ION, -IONS to leave the single term CONNECT) The suffix stripping process will reduce the total number of terms in the IR system, and hence reduce the size and complexity of the data in the system, which is always advantageous Stemmers (1) ELLOGON Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research, Greece Access: Free at http://www ellogon org/ FSA Author(s): Jan Daciuk, Rijksuniversiteit Groningen and Technical University of Gdansk, Poland Purpose: Supported languages: German, English, French, Polish Access: Free at http://juggernaut eti pg gda pl/~jandac/fsa html HEART Of GOLD Author(s): Ulrich Schäfer, DFKI Language Technology Lab, Germany Purpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian, Greek, German, French, English, Chinese Access: Free at http://heartofgold dfki de/ Stemmers (2) LANGSUITE Author(s): PetaMem Purpose: Supported languages: Unicode, Spanish, Polish, Italian, Hungarian, German, French, English, Dutch, Danish, Czech Access: Not free More information at http://www petamem com/ SNOWBALL Purpose: Presentation of stemming algorithms, and Snowball stemmers, for English, Russian, Romance languages (French, Spanish, Portuguese and Italian), German, Dutch, Swedish, Norwegian, Danish and Finnish Access: Free at http://www snowball tartarus org/ SProUT Author(s): Feiyu Xu, Tim vor der Brück, LT-Lab, DFKI GmbH, Germany Purpose: Available for: Unicode, Spanish, Japanese, German, French, English, Chinese Access: Not free More information at http://sprout dfki de/ TWOL Author(s): Lingsoft Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, Danish Access: Not free More information at http://www lingsoft fi/ Lemmatization • The process of grouping the inflected forms of a word together under a base form, or of recovering the base form from an inflected form, e g grouping the inflected forms COME, COMES, COMING, CAME under the base form COME • Dictionary based – Input: token + pos – Output: lemma • Note: needs POS information • Example: – left+v -> leave, left+a->left • It is the same as looking for a transformation to apply on a word to get its normalized form (word endings: what word suffix should be removed and/or added to get the normalized form) => lemmatization can be modeled as a machine learning problem Lemmatizers (1) CONNEXOR LANGUAGE ANALYSIS TOOLS Author(s): Connexor, Finland Purpose: Supported languages: English, French, Spanish, German, Dutch, Italian, Finnish Access: Not free Demos at http://www conexor fi/ ELLOGON Author(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research, Greece Access: Free at http://www ellogon org/ FSA Author(s): Jan Daciuk, Rijksuniversiteit Groningen and Technical University of Gdansk, Poland Purpose: Supported languages: German, English, French, Polish Access: Free at http://juggernaut eti pg gda pl/~jandac/fsa html MBLEM Author(s): ILK Research Group, Tilburg University Purpose: MBLEM is a lemmatizer for English, German, and Dutch Access: Demo at http://ilk uvt nl/mblem/ Lemmatizers (2) SWESUM Author(s): Hercules Dalianis, Martin Hassel, KTH, Euroling AB Purpose: Supported languages: Swedish, Spanish, German, French, English Access: Free at http://www euroling se/produkter/swesum html TREETAGGER Author(s): Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart, Germany Purpose: The TreeTagger has been successfully used for German, English, French, Italian, Spanish, Greek and old French texts and is easily adaptable to other languages if a lexicon is available Access: Free at http://www ims uni-stuttgart de/projekte/corplex/TreeTagger/DecisionTreeTagger html TWOL Author(s): Lingsoft Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, Danish Access: Not free More information at http://www lingsoft fi/ Shallow Parsing (chunking) • Partition the input into sequences of non- overlapping units, or chunks: sequences of words labelled with syntactic categories and possibly a marking to indicate which word is the head of the chunk • How? – Set of regular expressions over POS labels – Training the chunker on manually marked up text Nominal phrases Solicitat să comenteze [un editorial recent al lui [Dinu Patriciu]], în [care] [acesta] preciza că nu crede în [social-liberalism] şi să aprecieze dacă, astfel, a dat [o lovitură de [imagine]] [USL], [Antonescu] a spus că nu ştie dacă [Patriciu] s-a referit la [USL] Noun Phrase (NP) Chunkers fnTBL Author(s): Radu Florian and Grace Ngai, John Hopkins University, USA Purpose: fnTBL is a customizable, portable and free source machine-learning toolkit primarily oriented towards Natural Language-related tasks (POS tagging, base NP chunking, text chunking, end-of-sentence detection, word sense disambiguation) It is currently trained for English and Swedish Platforms: Linux, Solaris, Windows Access: Free at http://nlp cs jhu edu/~rflorian/fntbl/ YAMCHA Author(s): Taku Kudo Purpose: YamCha is a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking YamCha is using Support Vector Machines (SVMs), first introduced by Vapnik in 1995 YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task Platforms: Linux, Windows Access: Free at http://www2 chasen org/~taku/software/yamcha/ Named Entity Recognition • Identification of proper names in texts, and their classification into a set of predefined categories of interest: – entities: organizations, persons, locations – temporal expressions: time, date – quantities: monetary values, percentages, numbers • Two kinds of approaches Knowledge Engineering • rule based • developed by experienced language engineers • make use of human intuition • small amount of training data • very time consuming • some changes may be hard to accommodate Learning Systems • use statistics or other machine learning • developers do not need LE expertise • require large amounts of annotated training data • some changes may require re- annotation of the entire training corpus Named Entity Recognition Knowledge engineering approach identification of named entities in two steps: – recognition patterns expressed as WFSA (Weighted Finite- State Automaton) are used to identify phrases containing potential candidates for named entities (longest match strategy) additional constraints (depending on the type of candidate) are used for validating the candidates • usage of on-line base lexicon for geographical names, first names Named Entity Recognition Problems • Variation of NEs, e g John Smith, Mr Smith, John • Since named entities may appear without designators (companies, persons) a dynamic lexicon for storing such named entities is used Example: “Mars Ltd is a wholly-owned subsidiary of Food Manufacturing Ltd, a non- trading company registered in England Mars is controlled by members of the Mars family ” • Resolution of type ambiguity using the dynamic lexicon: If an expression can be a person name or company name (Martin Marietta Corp ) then use type of last entry inserted into dynamic lexicon for making decision • Issues of style, structure, domain, genre, etc • Punctuation, spelling, spacing, formatting Exemple de mențiuni de entități • entități imbricate/încuibărite (imbricated/nested) clădirea Universității din Iași è [clădirea [Universității din [Iași]NP3,NE2] NP2,NE1] NP1 clădirea Universității din Iași Exemple de mențiuni de entități • entități juxtapuse Western and Central Europe è [[Western T01] NE1/T04 and T02 [Central T03 Europe T04] NE2] NP1 Western and Central Europe Graphical Grammar Studio (GGS) • Framework for the development and processing of grammars • It incorporate a constraint description language: – implementation of composite features – look-ahead and look-behind assertions – priority scores can be placed on arcs è forces a preference order in processing paths • Designed with the main purpose to perform syntactical and sub-syntactical analysis GGS networks • Consume and annotate sequences of tokens or other XML elements The input tokens can include any number of associated attributes (usually denoting POSs, lemmas, articles in cases of nouns and adjectives, tokens IDs, etc ) • Structured as directed graphs: nodes express token consuming conditions, edges are directed: – some nodes can make jumps to other sub-graphs – networks are meant to be integrated into NLP chains – a GGS network is basically a finite state machine whose nodes can be associated with states GGS networks Bărăganul este cea mai mică câmpie è path 5 è Bărăganul GGS networks oceanele, de la mare la mic, sunt după cum urmează: Pacific, Atlantic, Indian, Antarctic și Arctic è Pacific Atlantic Indian Antarctic Arctic Detectarea grupurilor nominale dr Radu Simionescu 2013: MappingBooks: Let's jump out of the book in the real world! • First place in the BringITon!-2016 saloon of IT students creation • An UEFISCDI project that has run between 2014 – 2016: – partners: UAIC-FII (coordinator), University “ Ștefan cel Mare” Suceava, Siveco – Bucharest MappingBooks: what is it about? Creating a more intimate link between the book and its reader •Recognise in text mentions of locations •Crawl the web for supplementary information •Know where the reader is •Point entities mentioned in the text that are in the reader's proximity •Trace them on maps •Mix images with generated info 42 Linguistics Linked Open Data (LLOD) un subdomeniu al Prelucrării limbajului natural I like to read books and to travel… Going out of the book… Directions to Çukur Cuma Cd, Beyo!lu, Turkey 400 m – about 4 mins Walking directions are in beta Use caution – This route may be missing sidewalks or pedestrian paths Katip Çelebi Mh , Maç Sk, Beyo!lu, Turkey" 1 Head southwest on Maç Sk toward Baltacı Çk About 47 secs go 75 m total 75 m Çukur Cuma Cd, Beyo!lu, Turkey" These directions are for planning purposes only You may find that construction projects, traffic, weather, or other events may cause conditions to differ from the map results, and you should plan your route accordingly You must obey all signs or notices regarding your route Map data ©2013 Basarsoft 2 Turn right onto Turnacıba"ı Cd Map data ©2013 Basarsoft go 28 m total 100 m I need help to remember all kinship relations between characters Characters in Forsyte Saga 46 The old Forsytes Ann, the eldest of the family Old Jolyon, the patriarch of the family, having made a fortune in tea James, a solicitor, married to Emily, a most tranquil woman Swithin, James's twin brother with aristocratic pretensions; a bachelor Roger, "the original Forsyte" Julia (Juley), a fluttery dowager; Mrs Septimus Small Hester, an old maid Nicholas, the wealthiest in the family Timothy, the most cautious man in England Susan, the married sister The young Forsytes Young Jolyon, Old Jolyon's artistic and free-thinking son, married three times Soames, James and Emily's son, an intense, unimaginative and possessive solicitor, married to the unhappy Irene, who later marries Young Jolyon Winifred, Soames's sister, one of the three daughters of James and Emily, married to the foppish and lethargic Montague Dartie George, Roger's son, a dyed-in-the-wool mocker Francie, George's sister and Roger's daughter, emancipated from God Their children June, Young Jolyon's defiant daughter from his first marriage; engaged to an architect, Philip Bosinney, who becomes Irene's lover Jolly, Young Jolyon's son from his second marriage; dies of enteric fever during the Boer Wars Holly, Young Jolyon's daughter from his second marriage, to June's governess Jon, Young Jolyon's son from his third marriage, to Irene, Soames's first wife Fleur, Soames's daughter from his second marriage, to a French Soho shopgirl Annette; Jon's lover; later marries a baronet, Michael Mont Val, Winifred and Montague's son; fights in the Boer Wars; marries his cousin Holly Imogen, Winifred and Montague's daughter Others Parfitt, Old Jolyon's butler Smither, Aunts Ann, Juley and Hester's housekeeper Warmson, James and Emily's butler Bilson, Soames's housemaid Prosper Profond, Winifred's admirer and Annette's lover 47 MappingBooks • A MappedBook is a book connected with locations/events in the virtual and real world and sensitive to the instantaneous location (as seized by the mobile/tablet) of a reader • The information made available could possibly be different depending on the moment and the place of the reader MappingBooks • multi-dimensional mash-ups combining textual, geographical and temporal data • adequate presentation to the reader • links sensitive to: – the context of the mentions in the book, – the moment the user initiates an access – the current location of the user • make heavy use of entity linking techniques • spot the book mentions (persons and locations) in the real and virtual world 49 Aims 1) connect entities' mentions in the form of nominals (noun phrases) => one coreferential chain corresponds to each entity; 2) no preliminary records about linked entities => the knowledge base evolves from scratch; 3) look specially for coreferential (identity of entity mentions) and geographical relations (position, distance, point-of, near, intersects, etc ); 4) texts under investigation: Geography manuals and traveling guides MappingBooks: what is it about? • “ Understand” parts of a text • Recognise mentions of persons and locations • Recognise and crawl for real world entities • Know where I am • Seize what real world entities are in my proximity • Trace GoogleMaps paths, as described in the book • Fetch, process and make use of geo-data • Mix images with generated info • Display an attractive user interface • Client-server MappingBooks – an architecture Towards… live books • Multidimensional artefacts that combine textual, geographical, temporal, etc data • Evidence mentions of persons, locations… • Links sensible at: – the context of the mention in book – the location of the reader – the moment of the lecture – the personality and preferences of the reader 53 Usage examples - I visit a city with the traveling guide in my hand - places of interest, routes, are reordered depending on my instantaneous position 54 Usage examples - I am a school boy, in the train going from Brașov to Sibiu… - if I open my tablet and head it towards the left side window of the train, I will see arrows showing the picks of the Făgăraș mountains, exactly as in the Geography manual 55 Usage examples - I am in Paris for the 3rd time… - but only now my MB Lonely Planet guide signals me this temporary exhibition opened in the Pyramid 56 MappingBooks is addressed to… • Youngsters, school people – could we bring back the book in their hands? • Teenagers, adventurers, travelers, lovers of excursions – socialize about places mentioned in guides • Pensioners – socialize about common readings, cultural preferences • Researchers in Language Technology & Computation Linguistics – access to annotated linguistic resources • Owners of textual data (editing houses, media companies, tabloids) – sell better their products • Local administration, tourist agencies – ad on places of local touristic interest 57 The application 1) Connects mentions of entities (nominal groups) => one entity = a chain of coreferential mentions 2) The knowledge base does not include any apriory records about entities => starts from scratch 3) Identifies geographical relations (distances, positions, proximities, intersections, etc ) 4) Texts, for the time being: geography manuals 58 Instantaneous contexts (volatile data collected during daily or sporadic use, quickly changing) • information seized by sensors or inferred: – the place where you are now, recent travels, the history of past travels • information accumulated on the server: – the book that you read now, recent readings, the history of your readings • information inputted by the user: – intention to start a journey 60 Volatile profiling • Contextual information Þ Make up specific volatile “signatures” of users as attribute-value vectors Mixed profile • Sum up the two types of data used in profiling 61 Accessibility of personal data • In accord with the trendy typology of accessibility of data adopted by current social media networks: – public – restricted to classes of friends – for personal use only 62 Matching • Apply scored vector matching techniques against other users' signatures • Retain the best ranked matches 63 Networking Readers: enhance e-Books reading experience • Use semantic and geographical links to form communities – if books “subscripted for” declared visible => • current co-readers of B • co-readers of B – if “instantaneous or past location” declared visible => • current co-proximity of L • co-proximity of L • current co-track of T • co-track of T – any combinations 64 Networking Readers: enhance e-Books reading experience • Easy to imagine other ways to form communities rooted in lectures – intersect common readings and attended places with levels of friendship reported by other social media, like Facebook or Twitter – real-world events and entities mentioned in a book associated with real-world locations and particular moments of the year/day 65 MappingBooks processing of name entities • Phase 1: pre-processing • Phase 2: gazetteer + pattern-matching • Phase 3: merging + validation MB phase 1: pre-processing (PRE) • extract text from pdf: with iTEXT (https://itextpdf com/en ) è text without formating • correct diacritics and special characters • eliminate end-of-line separators and other remains from the original format • linguistic processing: tokenisation + lemmatisation, POS-tagging, NP and sentence chunking MB phase 2: gazetteer-applier (GAZ-APP) • Gazetteers (GAZ): lists of toponyms and other geographic names, grouped by categories, to identify entity candidates è a document containing annotations for surface names mentioned in GAZ: – type, subtype, coordinates, and other related geographical data are added – where ambiguous, a name will contain multiple tags, one for each category/subcategory and the disambiguation process is postponed Gazetteer 1 LOCATION (with 23 subtypes): all locations usually referenced on the map of a region: cities, ports, streets, etc ; 2 GEO POSITION (with 6 subtypes): map references (parallel, meridian, cardinal point); 3 GEOLOGY (with 6 subtypes): geological formations visible on a specific map; 4 LANDFORM (with 16 subtypes): types of physiographic formations usually indicated on maps (mountain, valley, cave, etc ); 5 CLIME (with 5 subtypes): meteorological data shown on some types of maps; 6 WATER (with 11 subtypes): variation of surface aquatic formation (river, lake, strait, etc ); 7 DIMENSION (with 9 subtypes): various ways in which geographical entities can be accompanied by (exact or approximated) values in text (height, depth, surface, etc ); 8 PERSON: names of people, accompanied by professions, where specified; 9 ORGANISATION (with 5 subtypes): military, education, etc , indicating also possible locations associated with a particular organisation type; 10 URL: web references; 11 TIMEX: dates, moments of time, intervals, etc ; 12 RESOURCE (with 4 subtypes): natural resources associated with locations; 13 INDUSTRY (with 4 subtypes): industrial areas (factories, electrical plants, etc ); 14 CULTURAL (with 6 subtypes): cultural areas (museums, parks, etc ); 15 UNKNOWN: for other geographical entities not covered by the above types Geonames http://www geonames org/ • Open resource: – 2 8 million entities, with 5 5 million alternative names – For Romania: 25 951 names, with over 45 000 alternative names • Names identified by letters: – A (country, state, region, …), H (stream, lake, …), L (parks, area, …), P (city, village, …), R (road, railroad, …), S (spot, building, farm, …), T (mountain, hill, rock, …), U (undersea, …), V (forest, heath, …) • For each of these types, besides geographical coordinates: values for specific attributes: • population (for P), surface (for A, H, L), height (for T), depth (for U), etc MB phase 2: pattern-matching (PAT-APP) • Uses a set of patterns, described in terms of the markings left in the document by the PRE module, to discover potential geographical entities • The difference between PAT-APP and GAZ-APP is that PAT-GAZ makes use of strictly proper names, while the patterns include also contextual words that appear in their vicinity and which are used to reduce the ambiguities PAT-APP • Takes as input a sequence of XML elements and a GGS network and tries to find a path in the network from its starting node to its ending node MB phase 3: merging and validation • Compares the two annotated files to take final decisions of all markings Decisions of the MER module • both GAZ-APP and PAT-APP annotate the same text span and the tag left by PAT-APP is among those left by GAZ-APP ⇒ the common tag is copied in the output file; • both GAZ-APP and PAT-APP annotate the same text span and the tag left by PAT-APP is not among those left by GAZ-APP ⇒ the PAT- APP tag is copied in the output file; • the text span annotated by GAZ-APP is included in the one annotated by PAT-APP and the tag left by PAT-APP is among those left by GAZ-APP ⇒ the common tag is copied on the largest text span in the output file; • the text span annotated by GAZ-APP is included in the one annotated by PAT-APP and the tag left by PAT-APP is not among those left by GAZ-APP ⇒ the PAT-APP tag is copied on the largest text span in the output file; Decisions of the MER module • these is an intersection between the text spans annotated by the two modules and the tag left by PAT-APP is among those left by GAZ-APP ⇒ the common tag is copied on the union of the text spans in the output file; • these is an intersection between the text spans annotated by the two modules and the tag left by PAT-APP is not among those left by GAZ-APP ⇒ the PAT-APP tag is copied on the union of the text spans in the output file; • only one of the two modules annotate a certain text span with one or more tags ⇒ one tag out of those annotated is chosen randomly for that text span in the output file More credibility is given to the PAT-APP module than to the GAZ-APP module, on the base that it uses the context to disambiguate names 